The tutorial covers the application of a number of exploratory analysis techniques to real-life microarray experiments
We will re-use the colon cancer data in GSE33126, and saw yesterday how to import these data into R.
library(GEOquery)
url <- "ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SeriesMatrix/GSE33126/"
filenm <- "data/GSE33126_series_matrix.txt.gz"
if(!file.exists(filenm)) download.file(paste(url, filenm, sep=""), destfile=filenm)
colonData <- getGEO(filename=filenm)
colonData
ExpressionSet (storageMode: lockedEnvironment)
assayData: 48803 features, 18 samples
element names: exprs
protocolData: none
phenoData
sampleNames: GSM820516 GSM820517 ... GSM820533 (18 total)
varLabels: title geo_accession ... data_row_count (31 total)
varMetadata: labelDescription
featureData
featureNames: ILMN_1343291 ILMN_1343295 ... ILMN_2416019 (48803 total)
fvarLabels: ID nuID ... GB_ACC (30 total)
fvarMetadata: Column Description labelDescription
experimentData: use 'experimentData(object)'
Annotation: GPL6947
We will log\(_2\) transform the expression data to put on a more convenient scale for analysis
exprs (colonData) <- log2 (exprs(colonData))
boxplot(exprs(colonData),outline=FALSE)
A first step in the analysis of microarray data is often to remove any uniformative probes. We can do this because typically only 50% of probes genes will be expressed, and even fewer than this will be differentially expressed. Including such non-informative genes in visualisa- tion will obscure any biological differences that exist. As we mentioned yesterday, the genefilter package contains a suite of tools for the filtering of microarray data. The varFilter function allows probes with low- variance to be removed from the dataset. The metric using to decide which probes to remove is the Inter-Quartile Range (IQR), and by default half of the probes are removed. Both the function used to do the filtering, and cut-off can be specified by the user.
library (genefilter)
dim (colonData)
Features Samples
48803 18
varFiltered <- varFilter (colonData)
dim (varFiltered)
Features Samples
24401 18
nrow (colonData) / nrow (varFiltered)
Features
2.000041
Perou et al. Molecular portraits of human breast tissues
When clustering genes, it is common to pre-process;
Common Similarity / Dissimilarity measures include
The dist function can be used to calculate a variety of similarity measures
?dist.set.seed here to make sure we always get the same valuesset.seed(10032016)
myMatrix <- matrix(rnorm(1000),ncol=10)
colnames(myMatrix) <- LETTERS[1:10]
head(myMatrix)
A B C D E F G H I J
[1,] 0.4313777 -2.4670200 -0.7073725 0.3194366 0.8602843 -0.7072524 -0.6794123 -1.6310056 0.4481794 -1.2323060
[2,] -0.1119854 0.2155933 1.3840235 -2.1342261 0.4873031 -1.4214807 0.8233594 1.1139286 -0.8939646 1.2021941
[3,] -0.8989685 0.4339697 1.5643489 -0.1795468 0.1206374 0.8601242 0.3049781 -0.1753819 1.5622296 -0.5362093
[4,] 0.6369946 -0.9573872 0.1323738 -0.3081852 1.0382901 -0.8216086 -1.9204444 -1.5448343 0.4831130 -0.9426206
[5,] 0.1357330 0.2440095 1.1538843 -1.0067510 -1.3158878 -0.2761261 -0.5078418 -2.5823836 0.4086202 -2.2609447
[6,] 0.6717109 -1.4595497 -0.2515446 -1.5297978 1.1495169 1.6037638 -0.6668203 1.2564544 2.1929750 -0.7531744
d <- dist(myMatrix)
d
td <- dist(t(myMatrix))
d
A B C D E F G H I
B 14.08415
C 14.88023 15.01007
D 15.07597 13.81700 13.92782
E 15.30797 15.18065 12.91680 14.36210
F 13.92214 15.01753 14.35741 12.90844 14.22044
G 16.78331 15.87847 14.93457 14.26315 13.65642 14.24959
H 15.18897 14.91864 13.59910 13.75474 13.14450 13.81787 14.26011
I 15.21106 15.15853 13.90017 13.34724 14.63817 12.75963 16.45115 13.91325
J 16.24253 14.95350 14.28564 14.37443 14.50068 15.69133 14.79427 14.30121 15.45298
method argument
?distd.man <- dist(t(myMatrix),method="manhattan")
d.man
A B C D E F G H I
B 115.2690
C 122.7021 121.9089
D 122.9448 113.7223 109.2488
E 121.8248 116.7150 104.0412 117.9643
F 111.4721 121.4489 118.7915 105.4497 116.4869
G 133.4619 132.4103 120.1602 110.9659 108.6304 109.9150
H 122.5602 114.8863 110.4001 106.7878 105.2342 114.3228 114.1843
I 122.4814 121.2553 114.1197 101.7003 117.7069 101.9944 131.5409 111.6427
J 131.3290 123.4407 115.3894 114.4036 122.6288 122.0493 122.2218 117.2269 125.1519
cor function to calculate correlation(?cor)dist?cor(myMatrix)
A B C D E F G H I J
A 1.00000000 0.1873257371 -0.023000453 -0.08803428 -0.13080526 0.1055754466 -0.18307097 -0.121188905 -0.01338224 -0.14899699
B 0.18732574 1.0000000000 -0.011511807 0.10968627 -0.08226229 0.0004291339 -0.04082896 -0.046354867 0.01698868 0.05761546
C -0.02300045 -0.0115118072 1.000000000 -0.03532202 0.10502192 -0.0409040361 -0.03337509 0.004657798 0.06663211 0.02827576
D -0.08803428 0.1096862726 -0.035322018 1.00000000 -0.16237803 0.1328635922 0.01087487 -0.064794815 0.10070537 -0.02098191
E -0.13080526 -0.0822622909 0.105021917 -0.16237803 1.00000000 -0.0677951324 0.09412331 0.021381549 -0.08673286 -0.04670010
F 0.10557545 0.0004291339 -0.040904036 0.13286359 -0.06779513 1.0000000000 0.08674360 -0.018787424 0.22943055 -0.16930951
G -0.18307097 -0.0408289615 -0.033375088 0.01087487 0.09412331 0.0867435951 1.00000000 0.015941398 -0.19833124 0.05636269
H -0.12118891 -0.0463548665 0.004657798 -0.06479482 0.02138155 -0.0187874244 0.01594140 1.000000000 0.01838101 -0.02447738
I -0.01338224 0.0169886817 0.066632105 0.10070537 -0.08673286 0.2294305467 -0.19833124 0.018381014 1.00000000 -0.07764364
J -0.14899699 0.0576154555 0.028275765 -0.02098191 -0.04670010 -0.1693095113 0.05636269 -0.024477377 -0.07764364 1.00000000
as.dist
as. functions exist to convert between various types of data. e.g. as.numeric, as.character etcabs will calculate the absolute valuecorMat <- as.dist(1-abs(cor(myMatrix)))
corMat
A B C D E F G H I
B 0.8126743
C 0.9769995 0.9884882
D 0.9119657 0.8903137 0.9646780
E 0.8691947 0.9177377 0.8949781 0.8376220
F 0.8944246 0.9995709 0.9590960 0.8671364 0.9322049
G 0.8169290 0.9591710 0.9666249 0.9891251 0.9058767 0.9132564
H 0.8788111 0.9536451 0.9953422 0.9352052 0.9786185 0.9812126 0.9840586
I 0.9866178 0.9830113 0.9333679 0.8992946 0.9132671 0.7705695 0.8016688 0.9816190
J 0.8510030 0.9423845 0.9717242 0.9790181 0.9532999 0.8306905 0.9436373 0.9755226 0.9223564
When clustering samples, each entry in the distance matrix should be the pairwise distance between two samples. As the ExpressionSet object has genes in the rows, and samples in the column we have to transpose the expression matrix.
N.B. to calculate the distances between samples, we have to transpose the expression matrix (e.g. using the function t). If we do not do this, R will try and compute distances between all genes which may take a long time or exceed the available memory)
euc.dist <- dist (t(exprs(varFiltered)))
euc.dist
GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522 GSM820523 GSM820524 GSM820525 GSM820526 GSM820527
GSM820517 146.35076
GSM820518 112.10192 145.53738
GSM820519 127.02787 111.88327 128.89085
GSM820520 103.88617 145.88415 93.26812 124.41043
GSM820521 115.74149 109.59666 112.44568 77.60015 110.62329
GSM820522 86.59678 140.12151 115.68936 131.12234 98.96944 117.26705
GSM820523 118.07552 115.02450 139.86229 101.32561 134.43368 89.05971 105.62731
GSM820524 113.05465 135.64296 106.83091 126.25701 85.61877 117.19595 85.85654 116.10403
GSM820525 142.82777 72.50357 145.43070 101.26515 138.18094 107.93645 127.62848 99.13618 120.46802
GSM820526 87.68842 160.67923 109.56682 126.28151 102.83077 114.59309 107.40340 134.03152 114.20701 154.74502
GSM820527 102.51958 121.92622 119.62366 78.01042 110.89035 69.13310 106.32390 77.38321 115.56748 109.20371 104.05182
GSM820528 85.71746 132.44295 96.04791 115.36869 75.14097 103.00987 89.80380 112.44894 90.45972 124.46834 97.24261 90.59615
GSM820529 107.07671 118.90261 120.08406 94.25470 114.46866 79.01849 116.33941 83.80927 123.55735 107.46869 119.74686 59.39630
GSM820530 80.36987 151.46130 94.29680 123.97811 86.59315 103.45768 97.19378 126.82149 107.83406 146.18764 87.43164 101.35185
GSM820531 108.26678 100.31635 115.03569 82.27479 112.72244 64.87145 116.26826 91.51850 122.61883 103.07141 119.87820 70.55040
GSM820532 128.93571 95.33138 115.04522 106.11806 116.81059 102.43630 124.92807 122.94258 116.57744 103.39879 137.24907 113.49486
GSM820533 106.21944 134.12719 115.78666 97.39843 115.37280 68.75089 116.89886 93.17867 127.75562 130.21971 109.87582 64.16850
GSM820528 GSM820529 GSM820530 GSM820531 GSM820532
GSM820517
GSM820518
GSM820519
GSM820520
GSM820521
GSM820522
GSM820523
GSM820524
GSM820525
GSM820526
GSM820527
GSM820528
GSM820529 86.75871
GSM820530 81.22235 103.68352
GSM820531 96.07646 70.24274 99.39146
GSM820532 105.14902 109.33036 120.61473 93.09175
GSM820533 99.91990 68.56935 96.27636 73.49531 119.19071
For gene-expression data, it is common to use correlation as a distance metric rather than the Euclidean. As we saw above, the cor function can be used to calculate the correlation of columns in a matrix. Each row (or column) in the resulting matrix is the correlation of that sample with all other samples in the dataset. The matrix is symmetrical and we can transform this into a distance matrix by first subtracting 1 from the correlation matrix. Hence, samples with a higher correlation have a smaller ‘distance’.
corMat <- cor(exprs(varFiltered))
corMat
GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522 GSM820523 GSM820524 GSM820525 GSM820526 GSM820527
GSM820516 1.0000000 0.8582704 0.9171976 0.8937222 0.9286023 0.9112575 0.9503610 0.9081569 0.9156037 0.8648318 0.9493074 0.9302064
GSM820517 0.8582704 1.0000000 0.8600812 0.9173538 0.8588151 0.9202012 0.8696543 0.9126298 0.8781899 0.9650722 0.8293478 0.9009531
GSM820518 0.9171976 0.8600812 1.0000000 0.8907537 0.9425576 0.9163944 0.9115547 0.8713382 0.9247661 0.8601046 0.9209805 0.9051381
GSM820519 0.8937222 0.9173538 0.8907537 1.0000000 0.8978211 0.9602207 0.8864231 0.9324985 0.8949583 0.9322218 0.8950744 0.9597308
GSM820520 0.9286023 0.8588151 0.9425576 0.8978211 1.0000000 0.9187180 0.9349881 0.8806670 0.9514800 0.8731558 0.9301280 0.9180982
GSM820521 0.9112575 0.9202012 0.9163944 0.9602207 0.9187180 1.0000000 0.9085925 0.9475820 0.9089643 0.9224893 0.9131154 0.9681213
GSM820522 0.9503610 0.8696543 0.9115547 0.8864231 0.9349881 0.9085925 1.0000000 0.9262929 0.9511794 0.8917089 0.9237271 0.9246435
GSM820523 0.9081569 0.9126298 0.8713382 0.9324985 0.8806670 0.9475820 0.9262929 1.0000000 0.9111565 0.9350314 0.8817770 0.9603706
GSM820524 0.9156037 0.8781899 0.9247661 0.8949583 0.9514800 0.9089643 0.9511794 0.9111565 1.0000000 0.9037952 0.9139681 0.9112436
GSM820525 0.8648318 0.9650722 0.8601046 0.9322218 0.8731558 0.9224893 0.8917089 0.9350314 0.9037952 1.0000000 0.8415122 0.9204252
GSM820526 0.9493074 0.8293478 0.9209805 0.8950744 0.9301280 0.9131154 0.9237271 0.8817770 0.9139681 0.8415122 1.0000000 0.9282016
GSM820527 0.9302064 0.9009531 0.9051381 0.9597308 0.9180982 0.9681213 0.9246435 0.9603706 0.9112436 0.9204252 0.9282016 1.0000000
GSM820528 0.9511778 0.8829546 0.9388145 0.9117054 0.9623691 0.9291090 0.9461775 0.9161095 0.9455812 0.8964670 0.9372344 0.9449786
GSM820529 0.9241083 0.9061505 0.9047172 0.9413361 0.9130390 0.9584997 0.9101066 0.9536168 0.8988913 0.9232237 0.9051941 0.9764954
GSM820530 0.9573796 0.8482370 0.9414236 0.8987844 0.9504108 0.9291209 0.9374851 0.8940675 0.9232363 0.8584350 0.9496136 0.9318156
GSM820531 0.9223823 0.9331725 0.9125297 0.9552921 0.9156400 0.9720189 0.9101826 0.9446634 0.9003839 0.9293514 0.9049499 0.9668194
GSM820532 0.8897132 0.9395367 0.9123688 0.9254827 0.9092395 0.9300936 0.8961047 0.8999407 0.9097986 0.9287621 0.8751776 0.9139221
GSM820533 0.9255052 0.8809038 0.9116292 0.9374967 0.9118986 0.9686817 0.9094910 0.9427849 0.8921813 0.8875923 0.9203758 0.9726729
GSM820528 GSM820529 GSM820530 GSM820531 GSM820532 GSM820533
GSM820516 0.9511778 0.9241083 0.9573796 0.9223823 0.8897132 0.9255052
GSM820517 0.8829546 0.9061505 0.8482370 0.9331725 0.9395367 0.8809038
GSM820518 0.9388145 0.9047172 0.9414236 0.9125297 0.9123688 0.9116292
GSM820519 0.9117054 0.9413361 0.8987844 0.9552921 0.9254827 0.9374967
GSM820520 0.9623691 0.9130390 0.9504108 0.9156400 0.9092395 0.9118986
GSM820521 0.9291090 0.9584997 0.9291209 0.9720189 0.9300936 0.9686817
GSM820522 0.9461775 0.9101066 0.9374851 0.9101826 0.8961047 0.9094910
GSM820523 0.9161095 0.9536168 0.8940675 0.9446634 0.8999407 0.9427849
GSM820524 0.9455812 0.8988913 0.9232363 0.9003839 0.9097986 0.8921813
GSM820525 0.8964670 0.9232237 0.8584350 0.9293514 0.9287621 0.8875923
GSM820526 0.9372344 0.9051941 0.9496136 0.9049499 0.8751776 0.9203758
GSM820527 0.9449786 0.9764954 0.9318156 0.9668194 0.9139221 0.9726729
GSM820528 1.0000000 0.9497681 0.9561975 0.9383683 0.9260060 0.9335861
GSM820529 0.9497681 1.0000000 0.9288649 0.9672203 0.9204350 0.9688689
GSM820530 0.9561975 0.9288649 1.0000000 0.9346090 0.9035232 0.9388144
GSM820531 0.9383683 0.9672203 0.9346090 1.0000000 0.9422943 0.9642204
GSM820532 0.9260060 0.9204350 0.9035232 0.9422943 1.0000000 0.9057118
GSM820533 0.9335861 0.9688689 0.9388144 0.9642204 0.9057118 1.0000000
cor.dist <- as.dist(1 - corMat)
The values given by the cor function can be either positive or negative, depending on whether two samples are positively or negatively correlated. However, for our distance matrix to contain only values in the range 0 to 1. In which case we would need to use the absolute values from the correlation matrix before converting to distances.
cor.dist <- as.dist(1 - abs(corMat))
We are now ready to use a clustering method.
hclust (?hclust)clust <- hclust(d)
clust
Call:
hclust(d = d)
Cluster method : complete
Distance : euclidean
Number of objects: 10
plot(clust)
clust.ward <- hclust(d,method = "ward.D")
par(mfrow=c(1,2))
plot(clust)
plot(clust.ward)
Exercise:
## Your answer here ##
The default plotting for a dendrogram labels the “leaves” with the column names from the input matrix, in our case the sample names from GEO. This may make the interpretation of the dendrogram difficult, as it may not be obvious which sample group each sample belongs to. We can alter the appearance of the dendrogram so that sample groups appear in the labels.
labels argument to specify a vector that you want to use to label each leaf
groups <- c(rep("Group1", 5),rep("Group2", 5))
groups
[1] "Group1" "Group1" "Group1" "Group1" "Group1" "Group2" "Group2" "Group2" "Group2" "Group2"
plot(clust,labels=groups)
Exercise:
pd <- pData(colonData)
View(pd)
## Your answer here ##
The WGCNA package in Bioconductor provides methods for finding clusters of correlated genes, which we will not be looking at in this tutorial. However, the package is of interest as it provides other visualisation methods for dendrograms which allows colours to be overlaid to distinguish sample groups.
library(WGCNA)
We need to create a vector of colours; one for each sample in our dataset
rep functiongroupColours
[1] "blue" "yellow" "blue" "yellow" "blue" "yellow" "blue" "yellow" "blue" "yellow" "blue" "yellow" "blue" "yellow"
[15] "blue" "yellow" "blue" "yellow"
Alternatively, one might re-set the levels of the vector to be the colours we want
groupColours <- SampleGroup
levels(groupColours) <- c("yellow","blue")
If we want to interpret the data presented in a clustering analysis, we need a way of extracting which samples are grouped together, or to determine the optimal grouping of samples.
cutree function to return the labels of samples that are belong to the same group when the dendrogram is cut at the specified height, h.k, that we want to create.library (cluster)
plot(clust.cor)
abline (h = 0.12, col = " red ")
cutree (clust.cor , h = 0.12)
GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522 GSM820523 GSM820524 GSM820525 GSM820526 GSM820527 GSM820528
1 2 1 2 1 2 1 2 1 2 1 2 1
GSM820529 GSM820530 GSM820531 GSM820532 GSM820533
2 1 2 2 2
cutree (clust.cor , k = 2)
GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522 GSM820523 GSM820524 GSM820525 GSM820526 GSM820527 GSM820528
1 2 1 2 1 2 1 2 1 2 1 2 1
GSM820529 GSM820530 GSM820531 GSM820532 GSM820533
2 1 2 2 2
table (cutree(clust.cor , k = 3) , SampleGroup)
SampleGroup
normal tumor
1 0 8
2 2 1
3 7 0
A heatmap is often used to visualise differences between samples.
Drawing a heatmap in R uses a lot of memory and can take a long time,
rowSds function from genefilter provides a convenient way of calculating the variability of each genegeneVar = rowSds(exprs(colonData))
sd(exprs(colonData)[1,])
[1] 0.07726564
geneVar[1]
ILMN_1343291
0.07726564
sd(exprs(colonData)[2,])
[1] 0.5612761
geneVar[2]
ILMN_1343295
0.5612761
length(geneVar)
[1] 48803
Next we can select which genes have the highest variance (say the top 100)
highVarGenes = order (geneVar, decreasing = TRUE )[1:100]
heatmap function expects a matrix object, so we have to convertlabCol allows us to label the columns in the plot (default is to use the column names from the original matrix used for clustering)Exercise:
From this plot we can already to discern patterns in the data
In a similar way to adding colours to a dendrogram (with plotDendroAndColors), we can add a colour bar underneath the sample dendrogram in the heatmap
ColSideColors
heatmap (as.matrix(exprs(colonData)[highVarGenes, ]),
labCol = SampleGroup,
ColSideColors = as.character(groupColours))
Exercise:
features <- fData(colonData)
View(features)
## Your answer here ##
?heatmap
Colv argument to NA. This tells the heatmap function not be cluster the columns.heatmap (as.matrix(exprs(colonData)[highVarGenes, ]),
labCol = SampleGroup , Colv=NA)
In this plot we set the column order according to the Sample Group
heatmap (as.matrix(exprs(colonData)[highVarGenes, order(SampleGroup)]),
labCol = SampleGroup[order(SampleGroup)], Colv = NA)
Alternatively, a pre-calculated dendrogram could be used.
clus.ward <- hclust (cor.dist , method = "ward")
The "ward" method has been renamed to "ward.D"; note new "ward.D2"
heatmap (as.matrix(exprs(colonData)[highVarGenes, ]) ,
Colv = as.dendrogram(clus.ward) , labCol = SampleGroup )
The colours used to display the gene expression values can also be modified. For this, we can use the RColorBrewer package which has functions for creating pre-defined palettes. The function display.brewer.all can be used to display the palettes available through this package.
You should avoid using the traditional red / green colour scheme as it may be difficult for people with colour-blindness to interpret!
library (RColorBrewer)
display.brewer.all()
hmcol <- brewer.pal(11 , "RdBu")
heatmap (as.matrix(exprs(colonData)[highVarGenes, ]) ,
ColSideColors = as.character(groupColours) , col=hmcol)
One drawback of the standard heatmap function is that it only allows one “track” of colours below the dendrogram. We might wish to display various sample groupings using this feature. The heatmap.plus package allows us to do just this.
library(heatmap.plus)
colourMatrix <- matrix(nrow=length(SampleGroup),ncol=2)
Patient <- pd$characteristics_ch1.1
patientCol <- rep(rainbow(n=length(unique(Patient))),each=2)
colourMatrix[,1] <- as.character(groupColours)
colourMatrix[,2] <- patientCol
heatmap.plus (as.matrix(exprs(colonData)[highVarGenes, ]) ,
ColSideColors = as.matrix(colourMatrix) , col=hmcol)
Another alternative is provided by the gplots package. The heatmap.2 function can be used in the same fashion as heatmap. The plots produced include a colour legend for the cells in the heatmap. By default, a density plot of each column is also produced.
library(gplots)
Attaching package: ‘gplots’
The following object is masked from ‘package:stats’:
lowess
heatmap.2 (as.matrix(exprs(colonData)[highVarGenes, ]) ,
ColSideColors = as.character(groupColours) , col=hmcol)
We can turn-off the column density if we wish.
heatmap.2 (as.matrix(exprs(colonData)[highVarGenes, ]) ,
ColSideColors = as.character(groupColours) , col=hmcol,trace="none")
A Silhouette plot can be used to choose the optimal number of clusters. For each sample, we calculate a value that quantifies how well it ‘fits’ the cluster that it has been assigned to. If the value is around 1, then the sample closely fits other samples in the same cluster. However, if the value is around 0 the sample could belong to another cluster. In the silhouette plot, the values for each cluster are plotted together and ordered from largest to smallest. The number of samples belonging to each group is also displayed.
The silhouette function is used to calculate this measure, and requires the output from a clustering method and the corresponding distance matrix.
silhouette(cutree(clust.cor, k = 2),cor.dist)
cluster neighbor sil_width
[1,] 1 2 0.3673737
[2,] 2 1 0.4225139
[3,] 1 2 0.3192528
[4,] 2 1 0.4239612
[5,] 1 2 0.4309915
[6,] 2 1 0.3954533
[7,] 1 2 0.3431538
[8,] 2 1 0.3738537
[9,] 1 2 0.3160520
[10,] 2 1 0.4228053
[11,] 1 2 0.3860076
[12,] 2 1 0.3241852
[13,] 1 2 0.3283811
[14,] 2 1 0.3531461
[15,] 1 2 0.4121053
[16,] 2 1 0.4285070
[17,] 2 1 0.2073579
[18,] 2 1 0.2268077
attr(,"Ordered")
[1] FALSE
attr(,"call")
silhouette.default(x = cutree(clust.cor, k = 2), dist = cor.dist)
attr(,"class")
[1] "silhouette"
par(mfrow=c(2,2))
plot(silhouette(cutree(clust.cor, k = 2), cor.dist), col = "red", main = paste("k=", 2))
plot(silhouette(cutree(clust.cor, k = 3), cor.dist), col = "red", main = paste("k=", 3))
plot(silhouette(cutree(clust.cor, k = 4), cor.dist), col = "red", main = paste("k=", 4))
plot(silhouette(cutree(clust.cor, k = 5), cor.dist), col = "red", main = paste("k=", 5))
The pvclust package is also able to provide some assessment on whether the clusters you get are significant or not.
install.packages(pvclust)
library(cluster)
supervised.clus <- pam(euc.dist,k=2)
clusplot(supervised.clus)
supervised.clus$clustering
GSM820516 GSM820517 GSM820518 GSM820519 GSM820520 GSM820521 GSM820522 GSM820523 GSM820524 GSM820525 GSM820526 GSM820527 GSM820528 GSM820529 GSM820530 GSM820531 GSM820532
1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 2
GSM820533
2
table(supervised.clus$clustering,SampleGroup)
SampleGroup
normal tumor
1 0 8
2 9 1
PCA is one of many dimension-reduction techniques that can remove redundancy and give a smaller more mangeable set of variables. In summary it can help us to:-
Compress the data
The data: 86 malt whiskies scored for 12 different taste categories
(https://www.mathstat.strath.ac.uk/outreach/nessie/nessie_whisky.html)
trying URL 'https://www.mathstat.strath.ac.uk/outreach/nessie/datasets/whiskies.txt'
Content type 'unknown' length 5510 bytes
==================================================
downloaded 5510 bytes
Allows us to visualise the dataset in two (perhaps three) dimensions and see the main sources of variation in the data
It is particularly useful to overlay known sample attributes to the plot
So, PCA can be used as a quality assessment tool and can inform us of any factors we need to account for in the analysis
A two-dimensional case
pca-explained
The prcomp function does the hard-work for us
scores
pca <- prcomp(scores)
pca
Standard deviations:
[1] 1.5358083 1.2269515 0.8653821 0.8039148 0.7526095 0.6851280 0.6325630 0.5994348 0.5234684 0.5004899 0.4242200 0.2726635
Rotation:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
Body 0.36119005 -0.49130643 0.0301178096 -0.07460952 0.227019231 -0.0790111501 0.029323978 0.51583781 0.13191814
Sweetness -0.20298238 -0.04659634 -0.2638792230 -0.37063897 0.009220720 -0.4914627169 0.433042081 0.16463400 0.43525065
Smoky 0.47794419 -0.06874216 0.2188100647 0.08852143 -0.201614612 -0.0667886441 -0.141399467 0.09637767 0.23285546
Medicinal 0.57527678 0.16079484 0.0431598432 0.08237288 -0.033120119 0.0867251958 -0.003251226 -0.01152970 0.18252089
Tobacco 0.09173306 0.02004776 -0.0006685359 -0.03337611 -0.008967789 -0.0121135184 -0.071870009 -0.09208249 0.15087664
Honey -0.22090804 -0.41799491 0.1102471134 0.03315990 -0.596888644 0.2901798422 0.169648859 0.36435767 -0.22653319
Spicy 0.05811101 -0.17548310 0.6992443906 -0.17163088 -0.133792990 -0.3915377946 0.238734685 -0.40845578 -0.14302248
Winey -0.03745608 -0.63964979 -0.2331295871 -0.22573776 0.110928262 0.0004448821 -0.481915927 -0.41773748 0.04104540
Nutty -0.04766410 -0.26036122 -0.1785529039 0.85059012 0.025288924 -0.3390781376 0.184515722 -0.14496809 0.03499854
Malty -0.12781608 -0.10296202 0.1084169899 0.07177790 -0.105401421 0.5185134071 0.251246664 -0.29837277 0.67782091
Fruity -0.20235755 -0.12374977 0.4034627791 0.09457148 0.702770800 0.2355608908 0.138418605 0.16304325 -0.03098856
Floral -0.38394443 0.13074914 0.3433008365 0.14903423 -0.120141367 -0.2515302423 -0.593377105 0.27961372 0.38441039
PC10 PC11 PC12
Body -0.31837708 -0.422014190 -0.009731685
Sweetness 0.24907687 0.203073487 0.025490489
Smoky -0.21079346 0.731698597 0.051961098
Medicinal 0.71965110 -0.243410656 0.123820632
Tobacco 0.04650015 -0.024513354 -0.975023155
Honey 0.30582604 0.082023515 -0.098150608
Spicy -0.02022326 -0.176878241 0.015678387
Winey 0.21792351 0.097975448 0.079440779
Nutty 0.04292721 -0.008915053 -0.027074653
Malty -0.19435167 -0.138999936 0.086625687
Fruity 0.28253840 0.304329726 -0.059059931
Floral 0.10597374 -0.160632164 0.051366908
names(pca)
[1] "sdev" "rotation" "center" "scale" "x"
summary(pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12
Standard deviation 1.5358 1.2270 0.8654 0.8039 0.75261 0.68513 0.63256 0.59943 0.52347 0.50049 0.42422 0.27266
Proportion of Variance 0.3011 0.1922 0.0956 0.0825 0.07231 0.05992 0.05108 0.04587 0.03498 0.03198 0.02297 0.00949
Cumulative Proportion 0.3011 0.4933 0.5889 0.6714 0.74370 0.80363 0.85471 0.90058 0.93556 0.96754 0.99051 1.00000
The variable loadings are given by the $rotation matrix, which has one row for each sample and one column for each principal component.
pca$rotation
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
Body 0.36119005 -0.49130643 0.0301178096 -0.07460952 0.227019231 -0.0790111501 0.029323978 0.51583781 0.13191814
Sweetness -0.20298238 -0.04659634 -0.2638792230 -0.37063897 0.009220720 -0.4914627169 0.433042081 0.16463400 0.43525065
Smoky 0.47794419 -0.06874216 0.2188100647 0.08852143 -0.201614612 -0.0667886441 -0.141399467 0.09637767 0.23285546
Medicinal 0.57527678 0.16079484 0.0431598432 0.08237288 -0.033120119 0.0867251958 -0.003251226 -0.01152970 0.18252089
Tobacco 0.09173306 0.02004776 -0.0006685359 -0.03337611 -0.008967789 -0.0121135184 -0.071870009 -0.09208249 0.15087664
Honey -0.22090804 -0.41799491 0.1102471134 0.03315990 -0.596888644 0.2901798422 0.169648859 0.36435767 -0.22653319
Spicy 0.05811101 -0.17548310 0.6992443906 -0.17163088 -0.133792990 -0.3915377946 0.238734685 -0.40845578 -0.14302248
Winey -0.03745608 -0.63964979 -0.2331295871 -0.22573776 0.110928262 0.0004448821 -0.481915927 -0.41773748 0.04104540
Nutty -0.04766410 -0.26036122 -0.1785529039 0.85059012 0.025288924 -0.3390781376 0.184515722 -0.14496809 0.03499854
Malty -0.12781608 -0.10296202 0.1084169899 0.07177790 -0.105401421 0.5185134071 0.251246664 -0.29837277 0.67782091
Fruity -0.20235755 -0.12374977 0.4034627791 0.09457148 0.702770800 0.2355608908 0.138418605 0.16304325 -0.03098856
Floral -0.38394443 0.13074914 0.3433008365 0.14903423 -0.120141367 -0.2515302423 -0.593377105 0.27961372 0.38441039
PC10 PC11 PC12
Body -0.31837708 -0.422014190 -0.009731685
Sweetness 0.24907687 0.203073487 0.025490489
Smoky -0.21079346 0.731698597 0.051961098
Medicinal 0.71965110 -0.243410656 0.123820632
Tobacco 0.04650015 -0.024513354 -0.975023155
Honey 0.30582604 0.082023515 -0.098150608
Spicy -0.02022326 -0.176878241 0.015678387
Winey 0.21792351 0.097975448 0.079440779
Nutty 0.04292721 -0.008915053 -0.027074653
Malty -0.19435167 -0.138999936 0.086625687
Fruity 0.28253840 0.304329726 -0.059059931
Floral 0.10597374 -0.160632164 0.051366908
We look at the variance explained by each component and judge where it “drops-off”
plot(pca)
scree
The new co-ordinate system is given by pca$x
pca$x[1:3,1:5]
PC1 PC2 PC3 PC4 PC5
Aberfeldy -0.5033841 -1.1220223 -0.1612002 0.5058255 -0.2841501
Aberlour -1.4788883 -3.0048507 1.5170911 -0.1385370 -0.7102894
AnCnoc -1.2531129 0.6537207 -0.2847196 0.9274739 0.1127587
Plotting the first and second columns as a scatter plot gives the plot we saw above
plot(pca$x[,1],pca$x[,2],
pch=16)
The ggplot2 package was actually used to create the plots above.
df <- data.frame(whisky,pca$x)
df
library(ggplot2)
p <- ggplot(df, aes(x=PC1,y=PC2,label=Distillery)) + geom_point() + geom_text(alpha=0.3)
p
p <- ggplot(df, aes(x=PC1,y=PC2,label=Distillery,cex=Smoky,col=as.factor(Floral))) + geom_point() + geom_text(cex=5,alpha=0.3)
p
In the above example, we had several whiskies and had made several observations about each
In the gene expression case, we have biological samples and the measurements we have made are gene expression levels
pca.geneExpression <- prcomp(t(exprs(varFiltered)))
summary(pca.geneExpression)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12
Standard deviation 43.059 32.0402 25.9056 20.22396 19.13627 17.02196 15.31561 14.48460 13.96177 12.3029 11.71167 10.47052
Proportion of Variance 0.305 0.1689 0.1104 0.06728 0.06024 0.04766 0.03859 0.03451 0.03207 0.0249 0.02256 0.01803
Cumulative Proportion 0.305 0.4739 0.5843 0.65157 0.71181 0.75947 0.79806 0.83257 0.86464 0.8895 0.91210 0.93014
PC13 PC14 PC15 PC16 PC17 PC18
Standard deviation 10.28270 9.67101 9.2904 8.68515 7.98016 9.635e-14
Proportion of Variance 0.01739 0.01539 0.0142 0.01241 0.01048 0.000e+00
Cumulative Proportion 0.94753 0.96292 0.9771 0.98952 1.00000 1.000e+00
plot(pca.geneExpression)
This time, the matrix $rotation has one row for each gene and one column for each component
$xhead(pca.geneExpression$rotation)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
ILMN_1343295 -0.0095862298 -0.0009427855 0.0110801132 -0.0020366425 0.0039612930 -0.0032575067 0.0059828351 0.0067242361
ILMN_1651228 -0.0053335480 -0.0008941161 -0.0009420078 0.0054523781 0.0004756251 -0.0000891311 0.0052662504 -0.0007894543
ILMN_1651229 0.0003598846 -0.0047977498 0.0093582103 0.0015679943 -0.0100423904 0.0005979499 0.0046918432 0.0015929813
ILMN_1651232 0.0001603066 0.0006883039 -0.0031648867 -0.0020969290 -0.0026851196 -0.0023032601 0.0031318975 -0.0024474328
ILMN_1651237 -0.0138286545 0.0045128692 -0.0012337188 0.0188516990 0.0031568146 -0.0122591132 0.0123202116 0.0120437964
ILMN_1651238 0.0013087021 -0.0017765173 -0.0014127659 0.0007139752 -0.0009831901 0.0016670602 0.0006600097 0.0038016640
PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16
ILMN_1343295 -0.0024826484 -0.0036753738 0.012501992 0.0051956481 0.003450512 0.0044519665 0.004664632 0.001982206
ILMN_1651228 -0.0042419894 -0.0001022168 0.004957987 -0.0037024962 -0.007970069 -0.0029922271 -0.004280022 0.010149830
ILMN_1651229 0.0065968144 0.0076748672 0.007200436 -0.0005963165 -0.005798041 -0.0005614111 0.005856582 0.002390333
ILMN_1651232 -0.0003499680 0.0018844619 0.002411133 0.0046666011 0.004277324 -0.0007469131 0.003443317 0.009219283
ILMN_1651237 -0.0035639146 -0.0078833076 0.002389122 -0.0052620911 0.016837024 -0.0027598384 -0.016956773 -0.006318824
ILMN_1651238 -0.0001027748 0.0024179293 -0.003991155 0.0022744048 -0.002592544 0.0010835169 -0.001669904 -0.001231031
PC17 PC18
ILMN_1343295 -0.0004580298 0.13839882
ILMN_1651228 -0.0034789913 -0.03148518
ILMN_1651229 -0.0008310495 0.27079362
ILMN_1651232 -0.0039999808 0.04115946
ILMN_1651237 0.0072625820 0.21630888
ILMN_1651238 0.0055826678 0.31714586
head(pca.geneExpression$x)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
GSM820516 -46.18825 -8.923357 -22.027635 -45.54360 -5.120973 0.9596011 -2.79250801 1.153655 -20.1787603 9.308557
GSM820517 75.22096 43.006495 6.985413 -28.54978 -5.304069 -7.1109538 -12.08615435 18.117841 11.9347181 13.045094
GSM820518 -42.24422 21.665633 45.344695 11.91361 7.695765 -43.0283554 -25.54010870 -10.291649 -14.5903341 2.048378
GSM820519 40.26441 -18.751675 15.197811 19.74728 -39.920828 12.3203241 9.23518684 -6.282000 -33.8832484 1.534679
GSM820520 -47.57229 26.537251 15.667782 26.27469 7.244000 26.6645067 -0.01388741 23.337475 0.3503335 16.369047
GSM820521 24.30946 -26.944598 16.189943 15.62615 -6.846618 -13.7638018 13.05966507 17.593653 10.3805459 5.867144
PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18
GSM820516 -20.377564 14.663920 1.799265 -0.3424831 -9.436514 -1.01024129 3.4917132 1.816873e-13
GSM820517 -6.033702 -5.765791 4.914927 -2.6736054 15.860885 -8.65846147 -1.9688994 2.411526e-14
GSM820518 3.518161 -2.011286 -1.611791 2.4299987 -1.029952 0.01001321 1.6112667 9.442100e-15
GSM820519 -2.039604 -2.955841 2.699569 -3.1791492 8.930876 2.00283873 -7.7536103 4.494538e-14
GSM820520 2.374769 11.061875 -17.186329 9.9447343 3.240512 -0.29662191 0.1385593 1.784250e-14
GSM820521 -2.950687 -1.517733 -5.912067 -21.2239810 -19.654984 -6.53524874 -0.5152540 1.676211e-13
Quite often, we plot the first few PCs against each other to visualise sample relationship.
plot(pca.geneExpression$x[,1],pca.geneExpression$x[,2])
We can improve this plot by adding colours for the different sample groups and choosing a different plotting character.
plot(pca$x[,1],pca$x[,2],
pch=16,col=as.character(groupColours))
We might also add a legend to explain the different colours.
plot(pca.geneExpression$x[,1],pca.geneExpression$x[,2],
pch=16,col=as.character(groupColours))
legend("bottomright",fill=c("blue","yellow"),legend=c("tumour","normal"))
text(pca.geneExpression$x[,1],pca.geneExpression$x[,2]-0.01,labels = pd$geo_accession)
Happily, the first component seems to separate the tumours from normals
boxplot(pca.geneExpression$x[,1] ~ SampleGroup)
The Bioconductor project has a collection of example datasets. Often these are used as examples to illustrate a particular package or functionality, or to accompany the analysis presented in a publication. For example, several datasets are presented to accompany the genefu package which has functions useful for the classification of breast cancer patients based on expression profiles. An experimental dataset can be installed and loaded as with any other Bioconductor package. The data itself is saved as an object in the package. You will need to see the documentation for the package to find out the relevant object name. The full list of datasets available through Bioconductor can be found here
library(breastCancerVDX)
library(breastCancerTRANSBIG)
data(vdx)
data(transbig)
dim(vdx)
Features Samples
22283 344
dim(transbig)
Features Samples
22283 198
annotation(vdx)
[1] "hgu133a"
annotation(transbig)
[1] "hgu133a"
If we want any classifers to be reproducible and applicable to other datasets, it is sensible to exclude probes that do not have sufficient annotation from the analysis. For this, we can use the genefilter package as before. The nsFilter function performs this annotation-based filtering as well as variance filtering. The output of the function includes details about how many probes were removed at each stage of the filtering.
library (genefilter)
vdx.filt <- nsFilter(vdx)
Attaching package: ‘S4Vectors’
The following object is masked from ‘package:gplots’:
space
The following objects are masked from ‘package:base’:
colMeans, colSums, expand.grid, rowMeans, rowSums
vdx.filt
$eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 6218 features, 344 samples
element names: exprs
protocolData: none
phenoData
sampleNames: VDX_3 VDX_5 ... VDX_2038 (344 total)
varLabels: samplename dataset ... e.os (21 total)
varMetadata: labelDescription
featureData
featureNames: 206797_at 203440_at ... 201130_s_at (6218 total)
fvarLabels: probe Gene.title ... GO.Component.1 (22 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
pubMedIds: 17420468
Annotation: hgu133a
$filter.log
$filter.log$numDupsRemoved
[1] 7408
$filter.log$numLowVar
[1] 6219
$filter.log$numRemoved.ENTREZID
[1] 2428
$filter.log$feature.exclude
[1] 10
vdx.filt <- vdx.filt[[1]]
Format the vdx data for pamr, and train a classifier to predict ER status. For extra clarity in the results, it might be useful to rename the binary er status used in the data package to something more descriptive.
library(pamr)
Loading required package: survival
dat <- exprs(vdx.filt)
gN <- as.character(fData(vdx.filt)$Gene.symbol)
gI <- featureNames (vdx.filt)
sI <- sampleNames (vdx.filt)
erStatus <- pData (vdx)$er
erStatus <- gsub (0 , "ER -" , erStatus )
erStatus <- gsub (1 , "ER +" , erStatus )
Fitting the model
train.dat <- list ( x = dat , y = erStatus , genenames = gN ,
geneid = gI , sampleid = sI )
model <- pamr.train(train.dat ,n.threshold = 100)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
model
Call:
pamr.train(data = train.dat, n.threshold = 100)
threshold nonzero errors
1 0.000 6218 41
2 0.130 5800 41
3 0.260 5362 39
4 0.390 4923 39
5 0.519 4498 39
6 0.649 4154 38
7 0.779 3786 38
8 0.909 3437 38
9 1.039 3141 38
10 1.169 2881 36
11 1.299 2621 34
12 1.428 2374 34
13 1.558 2194 34
14 1.688 1997 34
15 1.818 1820 34
16 1.948 1660 33
17 2.078 1535 33
18 2.208 1403 34
19 2.337 1279 33
20 2.467 1162 32
21 2.597 1065 32
22 2.727 992 32
23 2.857 908 33
24 2.987 821 33
25 3.116 751 33
26 3.246 695 32
27 3.376 641 31
28 3.506 575 32
29 3.636 520 33
30 3.766 458 33
31 3.896 397 33
32 4.025 359 33
33 4.155 318 33
34 4.285 283 33
35 4.415 245 33
36 4.545 218 33
37 4.675 190 33
38 4.805 170 34
39 4.934 155 34
40 5.064 132 34
41 5.194 117 35
42 5.324 103 35
43 5.454 89 36
44 5.584 77 36
45 5.714 68 36
46 5.843 63 36
47 5.973 57 36
48 6.103 54 37
49 6.233 49 37
50 6.363 44 37
51 6.493 41 37
52 6.623 37 37
53 6.752 34 36
54 6.882 32 37
55 7.012 28 37
56 7.142 27 37
57 7.272 25 38
58 7.402 23 39
59 7.531 19 41
60 7.661 17 42
61 7.791 17 42
62 7.921 17 43
63 8.051 17 43
64 8.181 16 43
65 8.311 15 44
66 8.440 13 43
67 8.570 11 48
68 8.700 8 48
69 8.830 7 48
70 8.960 6 51
71 9.090 6 54
72 9.220 6 56
73 9.349 5 57
74 9.479 4 58
75 9.609 4 59
76 9.739 4 63
77 9.869 3 66
78 9.999 3 69
79 10.129 3 73
80 10.258 3 79
81 10.388 3 100
82 10.518 3 118
83 10.648 2 134
84 10.778 2 135
85 10.908 2 135
86 11.038 1 135
87 11.167 1 135
88 11.297 1 135
89 11.427 1 135
90 11.557 1 135
91 11.687 1 135
92 11.817 1 135
93 11.946 1 135
94 12.076 1 135
95 12.206 1 135
96 12.336 1 135
97 12.466 1 135
98 12.596 1 135
99 12.726 1 135
100 12.855 0 135
We can perform cross-validation using the pamr.cv function. Printing the output of this function shows a table of how many genes were used at each threshold, and the number of classification errors. Both these values need to be taken into account when choosing a suit- able theshold. The pamr.plotcv function can assist with this by producing a diagnostic plot which shows how the error changes with the number of genes. In the plot produced by this function there are two panels; the top one shows the errors in the whole dataset and the bottom one considers each class separately. In each panel, the x axis corresponds to the thresh- old (and number of genes at each threshold) whereas the y-axis is the number of misclassifications.
model.cv <- pamr.cv(model , train.dat , nfold = 10)
12Fold 1 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
Fold 2 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
Fold 3 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
Fold 4 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
Fold 5 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
Fold 6 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
Fold 7 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
Fold 8 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
Fold 9 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
Fold 10 :123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100
model.cv
Call:
pamr.cv(fit = model, data = train.dat, nfold = 10)
threshold nonzero errors
1 0.000 6218 43
2 0.130 5800 43
3 0.260 5362 43
4 0.390 4923 42
5 0.519 4498 42
6 0.649 4154 42
7 0.779 3786 42
8 0.909 3437 41
9 1.039 3141 41
10 1.169 2881 39
11 1.299 2621 40
12 1.428 2374 39
13 1.558 2194 39
14 1.688 1997 37
15 1.818 1820 37
16 1.948 1660 36
17 2.078 1535 36
18 2.208 1403 38
19 2.337 1279 37
20 2.467 1162 36
21 2.597 1065 35
22 2.727 992 35
23 2.857 908 35
24 2.987 821 34
25 3.116 751 35
26 3.246 695 36
27 3.376 641 36
28 3.506 575 36
29 3.636 520 36
30 3.766 458 35
31 3.896 397 35
32 4.025 359 34
33 4.155 318 33
34 4.285 283 33
35 4.415 245 34
36 4.545 218 34
37 4.675 190 34
38 4.805 170 35
39 4.934 155 35
40 5.064 132 35
41 5.194 117 36
42 5.324 103 36
43 5.454 89 36
44 5.584 77 36
45 5.714 68 36
46 5.843 63 36
47 5.973 57 36
48 6.103 54 37
49 6.233 49 37
50 6.363 44 38
51 6.493 41 38
52 6.623 37 38
53 6.752 34 39
54 6.882 32 39
55 7.012 28 39
56 7.142 27 39
57 7.272 25 40
58 7.402 23 40
59 7.531 19 40
60 7.661 17 42
61 7.791 17 42
62 7.921 17 42
63 8.051 17 43
64 8.181 16 43
65 8.311 15 44
66 8.440 13 45
67 8.570 11 49
68 8.700 8 50
69 8.830 7 53
70 8.960 6 53
71 9.090 6 56
72 9.220 6 56
73 9.349 5 60
74 9.479 4 61
75 9.609 4 64
76 9.739 4 68
77 9.869 3 74
78 9.999 3 78
79 10.129 3 81
80 10.258 3 87
81 10.388 3 99
82 10.518 3 110
83 10.648 2 122
84 10.778 2 131
85 10.908 2 135
86 11.038 1 135
87 11.167 1 135
88 11.297 1 135
89 11.427 1 135
90 11.557 1 135
91 11.687 1 135
92 11.817 1 135
93 11.946 1 135
94 12.076 1 135
95 12.206 1 135
96 12.336 1 135
97 12.466 1 135
98 12.596 1 135
99 12.726 1 135
100 12.855 0 135
pamr.plotcv(model.cv)
In the following sections, feel free to experiment with different values of the threshold (which we will call Delta) The misclassifications can easily be visualised as a ’confusion table’. This simply tabulates the classes assigned to each sample against the original label assigned to the sample. e.g. Misclassifications are samples that we thought were ’ER+’ but have been assigned to the ’ER-’ group by the classifier, or ’ER-’ samples assigned as ’ER+’ by the classifier.
Delta <- 8
pamr.confusion(model.cv , Delta)
ER - ER + Class Error rate
ER - 106 29 0.21481481
ER + 14 195 0.06698565
Overall error rate= 0.125
A visual representation of the class separation can be obtained using the pamr.plotcvprob function. For each sample there are two circles representing the probabilty of that sample being classified ER- (red) or ER+ (green).
pamr.plotcvprob(model , train.dat , Delta )
There are a couple of ways of extract the details of the genes that have been used in the classifier. We can list their names using the pamr.listgenes function, which in our case these are just returns the microarray probe names. We can however, use these IDs to query the featureData stored with the original vdx object. We can also plot the expression values for each gene, coloured according to the class label.
pamr.listgenes(model , train.dat , Delta )
id ER --score ER +-score
[1,] 205225_at -0.3257 0.2104
[2,] 209173_at -0.202 0.1305
[3,] 209603_at -0.1753 0.1132
[4,] 204508_s_at -0.1184 0.0764
[5,] 205009_at -0.0987 0.0638
[6,] 214440_at -0.083 0.0536
[7,] 205186_at -0.0583 0.0376
[8,] 219197_s_at -0.0507 0.0327
[9,] 209459_s_at -0.0452 0.0292
[10,] 212956_at -0.0425 0.0274
[11,] 218976_at -0.0402 0.026
[12,] 203929_s_at -0.035 0.0226
[13,] 204623_at -0.0325 0.021
[14,] 215729_s_at 0.0295 -0.0191
[15,] 209800_at 0.0211 -0.0136
[16,] 218211_s_at -0.018 0.0116
[17,] 205862_at -0.0109 0.0071
classifierGenes <- pamr.listgenes(model , train.dat , Delta )[,1]
id ER --score ER +-score
[1,] 205225_at -0.3257 0.2104
[2,] 209173_at -0.202 0.1305
[3,] 209603_at -0.1753 0.1132
[4,] 204508_s_at -0.1184 0.0764
[5,] 205009_at -0.0987 0.0638
[6,] 214440_at -0.083 0.0536
[7,] 205186_at -0.0583 0.0376
[8,] 219197_s_at -0.0507 0.0327
[9,] 209459_s_at -0.0452 0.0292
[10,] 212956_at -0.0425 0.0274
[11,] 218976_at -0.0402 0.026
[12,] 203929_s_at -0.035 0.0226
[13,] 204623_at -0.0325 0.021
[14,] 215729_s_at 0.0295 -0.0191
[15,] 209800_at 0.0211 -0.0136
[16,] 218211_s_at -0.018 0.0116
[17,] 205862_at -0.0109 0.0071
pamr.geneplot(model , train.dat ,Delta)
You may get an error message Error in plot.new(): Figure margins too large when trying to produce the gene plot. If this occurs, try increasing the size of your plotting window, or decrease the number of genes by decreasing the threshold. Alternatively, the fol- lowing code will write the plots to a pdf.
pdf ("classifierProfiles.pdf")
for (i in 1: length (classifierGenes)) {
Symbol <- fData(vdx.filt)[classifierGenes[i] , "Gene.symbol"]
boxplot(exprs(vdx.filt)[classifierGenes[i], ] ~ erStatus ,
main = Symbol )
}
dev.off()
null device
1
Use the genes identified by the classifier to produce a heatmap to confirm that they separate the samples as expected.
symbols <- fData(vdx.filt)[classifierGenes , "Gene.symbol"]
heatmap(exprs(vdx.filt)[classifierGenes, ] , labRow = symbols )
We can now test the classifier on an external dataset. We choose the transbig dataset for simplicity as it was generated on the same microarray platform
library (breastCancerTRANSBIG)
data (transbig)
pData (transbig)[1:4, ]
transbig.filt <- transbig [featureNames(vdx.filt) , ]
predClass <- pamr.predict(model ,exprs(transbig.filt) ,Delta )
table (predClass, pData(transbig)$ er)
predClass 0 1
ER - 52 11
ER + 12 123
boxplot (pamr.predict(model , exprs(transbig.filt), Delta ,
type = "posterior" )[, 1] ~ pData(transbig)$er)
Make a heatmap of the transbig data using the genes involved in the vxd classifier
erLab <- as.factor(pData(transbig)$er)
levels (erLab) <- c ("blue" , "yellow")
heatmap (exprs(transbig.filt)[classifierGenes , ] , labRow = symbols ,
ColSideColors = as.character (erLab))
An attractive feature of the vdx dataset is that it includes survival data for each breast can- cer patient. We are not explicitly covering survival analysis in this course, but for your reference, here are the commands to create survival curves when patients are grouped by ER status and tumour grade.
library (survival)
par (mfrow = c (1 , 2))
plot (survfit (Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
pData(vdx)$er) , col = c("cyan" , "salmon"))
plot (survfit(Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
pData (vdx)$grade) , col = c("blue" , "yellow" , "orange"))
survdiff(Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
pData (vdx)$er)
Call:
survdiff(formula = Surv(pData(vdx)$t.dmfs, pData(vdx)$e.dmfs) ~
pData(vdx)$er)
N Observed Expected (O-E)^2/E (O-E)^2/V
pData(vdx)$er=0 135 38 45.5 1.244 2.04
pData(vdx)$er=1 209 80 72.5 0.781 2.04
Chisq= 2 on 1 degrees of freedom, p= 0.154
survdiff(Surv(pData(vdx)$t.dmfs , pData(vdx)$e.dmfs) ~
pData(vdx)$grade)
Call:
survdiff(formula = Surv(pData(vdx)$t.dmfs, pData(vdx)$e.dmfs) ~
pData(vdx)$grade)
n=197, 147 observations deleted due to missingness.
N Observed Expected (O-E)^2/E (O-E)^2/V
pData(vdx)$grade=1 7 0 3.06 3.06 3.21
pData(vdx)$grade=2 42 10 16.69 2.68 3.53
pData(vdx)$grade=3 148 61 51.26 1.85 6.72
Chisq= 7.7 on 2 degrees of freedom, p= 0.0218